MSDS 7337 - Natural Language Processing

Homework 8 - Topic Modeling

Evangelos Giakoumakis

Library imports

In [1]:
from __future__ import print_function
import platform
import sys
import nltk
import requests
import re
from requests import get
from itertools import repeat
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from nltk.stem.snowball import SnowballStemmer
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud, ImageColorGenerator

General system information and library versions

In [2]:
import platform; print(platform.platform())
import sys;print('Python', sys.version)
import os;print('OS', os.name)
import bs4;print('Beautiful Soup', bs4.__version__)
#import urllib;print('Urllib', urllib.request.__version__) 
import re;print('Regex', re.__version__)
import spacy;print('SpaCy', spacy.__version__)
#import gensim;print('Gensim', gensim.__version__)
import sklearn;print('Sklearn', sklearn.__version__)
import scipy;print('Scipy', scipy.__version__)
import matplotlib;print('Matplotlib', matplotlib.__version__)
print (os.environ['CONDA_DEFAULT_ENV'])
Windows-10-10.0.17134
Python 2.7.15 |Anaconda, Inc.| (default, May  1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)]
OS nt
Beautiful Soup 4.6.3
Regex 2.2.1
SpaCy 2.0.12
Sklearn 0.19.1
Scipy 1.1.0
Matplotlib 2.2.2
base

Collection of reviews

Function to parse the first 25 reviews given a movie link

In [3]:
def movie_review_crawler(url):
    response = get(url)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    rev_containers = html_soup.find_all('div', class_ = 'text show-more__control')
    reviews = []
    for rv in rev_containers:
        reviews.append(rv.text)
    return reviews

Function to create a list of 100 reviews of science fiction movies Films used: (Matrix, Inception, Sunshine, Dragonball)

In [4]:
def my_review_crawler():
        matrix = movie_review_crawler("https://www.imdb.com/title/tt0133093/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=10")
        matrix2 = movie_review_crawler("https://www.imdb.com/title/tt0133093/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=2")
        inception = movie_review_crawler("https://www.imdb.com/title/tt1375666/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=10")
        inception2 = movie_review_crawler("https://www.imdb.com/title/tt1375666/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=2")
        sunshine = movie_review_crawler("https://www.imdb.com/title/tt0448134/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=9")
        sunshine2 = movie_review_crawler("https://www.imdb.com/title/tt0448134/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=2")
        dragonball = movie_review_crawler("https://www.imdb.com/title/tt1098327/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=8")
        dragonball2 = movie_review_crawler("https://www.imdb.com/title/tt1098327/reviews?sort=helpfulnessScore&dir=desc&ratingFilter=1")
        scifi_reviews = matrix + inception + sunshine + dragonball + matrix2 + inception2 + sunshine2 + dragonball2
        return scifi_reviews

Fetch and place all reviews on list called reviews

In [5]:
reviews = my_review_crawler()

Inspect the first 10 reviews collected

In [6]:
reviews[:10][:50]
Out[6]:
[u"The story of a reluctant Christ-like protagonist set against a baroque, MTV backdrop, The Matrix is the definitive hybrid of technical wizardry and contextual excellence that should be the benchmark for all sci-fi films to come.Hollywood has had some problems combining form and matter in the sci-fi genre.  There have been a lot of visually stunning works but nobody cared about the hero. (Or nobody simply cared about anything.)  There a few, though, which aroused interest and intellect but nobody 'ooh'-ed or 'aah'-ed at the special effects.  With The Matrix, both elements are perfectly en sync.  Not only did we want to cheer on the heroes to victory, we wanted them to bludgeon the opposition.  Not only did we sit in awe as Neo evaded those bullets in limbo-rock fashion, we salivated.But what makes The Matrix several cuts above the rest of the films in its genre is that there are simply no loopholes.  The script, written by the Wachowski brothers is intelligent but carefully not geeky.  The kung-fu sequences were deftly shot -- something even Bruce Lee would've been proud of.  The photography was breathtaking.  (I bet if you had to cut every frame on the reel and had it developed and printed, every single frame would stand on its own.)  And the acting?  Maybe not the best Keanu Reeves but name me an actor who has box-office appeal but could portray the uneasy and vulnerable protagonist, Neo, to a T the way Reeves did.  But, come to think of it, if you pit any actor beside Laurence Fishburne, you're bound to confuse that actor for bad acting.  As Morpheus, Mr. Fishburne is simply wicked!  Shades of his mentor-role in Higher Learning, nobody exudes that aura of quiet intensity than Mr. Fishburne.  His character, battle-scarred but always composed Morpheus, is given an extra dose of mortality (He loves Neo to a fault.) only Mr. Fishburne can flesh out.People will say what they want to say about how good The Matrix is but the bottomline is this: finally there's a philosophical film that has cut through this generation.  My generation. The Wachowski brothers probably scribbled a little P.S. note when they finished the script saying: THINK FOR A MOMENT ABOUT YOUR EXISTENCE.  What is the Matrix, you ask?  Something that's closer to reality than you think.Either that or it's my personal choice for best film of all-time.",
 u"Without a doubt one of the best and most influential movies of all time, the Matrix is the defining science fiction film of the 1990's and the biggest leap the genre has taken since Stanley Kubrick's 2001: a Space Odyssey and Ridley Scott's Blade Runner. The Matrix is a ground-breaking motion picture that not only raised the bar for all the science-fiction films to come after it but also redefined the action genre with its thrilling action sequences and revolutionary visual effects.The film tells the story of Thomas Anderson a computer hacker that in the world of hacking goes by the alias of Neo. When he is contacted by the mysterious outlaw Morpheus and having always questioned his reality, he is awakened to the truth that the world he's been living in is a simulated reality called the Matrix and that he's nothing more than a slave in this dystopian world, created and controlled by A.I powered machines.The direction and script by the Wachowskis is fantastic, as they drew ideas and inspirations from every other great sci-fi and cyberpunk movie and anime before the film, combining it with stunning action and putting it into one picture that has enough style, substance and subtext that everyone ended up giving their own interpretation of the story. The research that went into the preparation of the screenplay is quite extensive but the manner in which it is presented on the big screen is also very impressive. Every character presented on the film, has a well-defined arc and a purpose, and their motivations are clear.The cinematography is impeccable. It was very innovative in the use of the camera angles and movements, the zooms, the slow motion captures and the different color palette used to differentiate the Matrix and the real World. The editing is flawless, as it makes sure that every scene is integral to the story and ensures the pace of the film stays ferocious through its entire runtime. Each frame is also packed with so much visual information for the viewer to devour. The visual effects introduced us to the bullet-time effect and their impact can still be felt in today's movies. The performances are also incredible. Each member of the cast gave their best performances and brought the characters they portray to life, but the one that stands out the most is Hugo Weaving as Agent Smith in undoubtedly the greatest performance in his career.In conclusion The Matrix is a masterpiece everyone should see. It is one of the most thought provoking, inventive, pioneering, influential and stylish movies of all time and it's also full of philosophical and religious allegories. Immortal for its contribution to cinema and pop culture, its brilliant combination of inventive visual effects, excellent vision and exquisite action easily makes it one of the best, most influential and most entertaining movies ever made.",
 u'** May contain spoilers **There aren\'t many movies I watched in the theatre twice \x96 let alone on the same day - but immediately after the credits had rolled (and still pumped up by \'Rage against the Machine\'), I queued up for the next screening of \'The Matrix\'. I was so blown away by that film, I feared - and probably rightly so - that I hadn\'t caught every detail of what I\'d just seen. I later found out that many of my friends had had a similar reaction to the film, and I know virtually no one who liked the film and didn\'t watch it at least twice. It\'s simply one of those rare films that are so rich you just have to watch them several times.In structure, style and concept, \'The Matrix\' was ground-breaking; it marked the first time the visual style of Manga comic books and Anime such as \'Akira\' or \'Ghost in the Shell\' had been successfully translated to a live-action film. Apart from \'Blade Runner\', which has a totally different mood and pace (but is also a masterpiece and visionary film-making), there simply hadn\'t been anything even remotely like it. The jaw-dropping action sequences have such a raw, gripping energy they feel like an adrenalin overdose, but unlike most action films, they never overshadow the story; on the contrary - they enhance it and make complete sense within that universe.As for the story itself, I think this is one of the most original, fascinating Sci-Fi tales you\'ll likely ever see on screen. Clearly inspired by Japanese Anime and Manga yet also by authors like Isaac Asimov or Philip K. Dick, the story about humanity\'s war against its own creation, machines of an artificial intelligence that have evolved to the point where they have become the dominant \'species\' and vastly superior to their creators, could take place in the same world as \'Blade Runner\' or \'The Terminator\' - albeit several hundred years later. But there is also a mythical, even religious undercurrent to the story; the themes of a prophecy, a "liberator" or even a "messiah" make \'The Matrix\' transcend the Science-Fiction genre and become even more unique.\'The Matrix\' was a watershed moment in filmmaking \x96 in every respect \x96 and even though two inferior sequels have left a bit of a stain on the film, they can\'t distract from what an uncompromising and hugely influential masterpiece this is. Sci-Fi movies that were released after \'The Matrix\' have tried very hard to achieve a similar look and tone, but the original still owns them all. 10 stars out of 10.Favorite films: http://www.IMDb.com/list/mkjOKvqlSBs/Lesser-known Masterpieces: http://www.imdb.com/list/ls070242495/Favorite TV-Shows reviewed: http://www.imdb.com/list/ls075552387/',
 u'Writing a review of The Matrix is a very hard thing for me to do because this film means a lot to me and therefore I want to do the film justice by writing a good review. To tell the truth the first time I saw the film I was enamored by the effects. I remember thinking to myself that this was one of the most visually stunning films I had ever seen in my life. Also having always been a comic book fan and a fan of films that were larger than life, the transitional element of the story was very appealing to me and this probably heightened my enjoyment of the film very much. It wasn\'t until some time later (and after having seen the film a few times more) that I started to think about the film. I recognized the Christian elements quite quickly but it wasn\'t until I wrote an actual 15-page essay on the film that I tapped into some of the philosophical and religious elements and that made me appreciate the film even more. I won\'t say that I have recognized all elements because the film is quite literally packed with them.Acting wise the film works excellently. I won\'t say that there aren\'t any issues because there are but overall the acting is pretty flawless. Keanu Reeves plays the main character, Neo, or Thomas A. Anderson and while he is not the perfect actor I think he does a pretty good job in The Matrix (and the sequels). He doesn\'t have the longest of lines which was probably a deliberate choice from the directors and it works because this gives him a better opportunity to work on posture and facial expressions and I must say that overall his body language is very good. Very clear and well defined. Laurence Fishbourne plays Neo\'s mentor Morpheus and he does an excellent job of it. His lines flow with a certain confidence and style that makes his character somewhat unique and interesting. Carrie-Anne Moss does a good job as well and succeeds in looking both cool and sexy in her leather outfit. Joe Pantoliano, a critically underrated actor does a brilliant job of bringing his character, Cypher, to life. I can\'t say much about him because his character is pretty essential to the plot and I certainly don\'t wan\'t to spoil it for anyone. Gloria Foster appears in a relatively small role that will have greater significance in the following films and she does a very good job. The best acting is provided by Hugo Weaving, however, in his portrayal of Agent Smith. It is really something to watch him act out the changes in his character. Agent Smith gains some human traits like anger, sense of dread, hate and eventually even a sly sense of humor (mostly in the sequels). Two thumbs way up to Weaving who has created one of the finest screen villains of all time.Effects wise the film is simply stunning and it deservedly was awarded the Oscar for best effects (and was regrettably cheated out of a nomination in the Best Film category) ahead of even Star Wars. The reason that I think The Matrix deserves the Oscar for best effects is simply that the effects in The Matrix are more innovative than the ones in Star Wars. Just take a look at how many times the effects have been spoofed and you\'ll probably agree. The effects also help in the symbolism of the film and in creating a very dystopian atmosphere not unlike the one seen in Blade Runner and this works brilliantly. The film looks beautiful at all times and today 6 years later (my God has it already been 6 years?) the effects still hold their ground against new science fiction films. Add the effects to the brilliant editing and you have a visual masterpiece on your hands. Very well done.The reason that I think The Matrix is more reviewable than pretty much any other film is the story and the philosophical and religious elements of the story because with every viewing I catch something I didn\'t see the previous time I watched it. Without spoiling the film I think I can mention a few of the more obvious elements. Obviously the film draws on the Messiah myth as Neo is a clear reference to Jesus with the analogy of his name (Neo = one, as in The One) but also hidden in his other name, Thomas A. Anderson. The first part of his last name, Anderson comes from the Greek Andros meaning "man" and combine this with the second part of his last name "son" and add a little creativity you will come up with the combination "son of man" which was a title Jesus came up with about himself. Also the first time we meet Neo a man calls him (and I quote): "You\'re my Saviour man. My own personal Jesus Christ." It doesn\'t get any more obvious than that. Aside from the Christianic elements the film also gets its inspiration from Budhism, Gnosticism (Gnosis = knowledge) but is also inspired by Plato and his analogy of the Cave and Jean Baudrillard\'s essay, Simulacra and Simulations. Explaining these elements would make this review go on forever so aside from mentioning them I will not comment on them further.To all the people who doubt the profound nature of The Matrix I can only give one advice: Free your mind and watch the film again. You won\'t regret it. If I had to choose a favorite all time film my choice would probably fall on either The Matrix (obviously I don\'t expect people to agree but if they do thats great) or The Lord of the Rings: The Return of the King and I recommend it to all fans of sci-fi and people who like philosophy.10/10 - on my top 3 of best films.',
 u'My review of the best epic Science Fiction Action film, The Matrix (1999) starring Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, Joe Pantoliano, Marcus Chong and Gloria Foster.This was be the first movie I went to see in the movie theater with my mom when I was 15.years old, when I read in the magazines about The Matrix I was blown away and I wanted to see it right away. The Matrix is the best action sci-fi films that Keanu Reeves made in the 90\'s. It is one of my personal favorite movies. The Matrix is a (1999) American science fiction action film written and directed by The Wachowskis, starring Keanu Reeves, Laurence Fishburne, Carrie-Anne Moss, Hugo Weaving, and Joe Pantoliano. It depicts a dystopia future in which reality as perceived by most humans is actually a simulated reality called "the Matrix", created by sentient machines to subdue the human population, while their bodies\' heat and electrical activity are used as an energy source. Computer programmer "Neo" learns this truth and is drawn into a rebellion against the machines, which involves other people who have been freed from the "dream world".Written and directed by the Wachowski brothers, this classic sci-fi action film stars Keanu Reeves as a lowly computer programmer who discovers his entire existence, and that of all mankind, is nothing but a simulation in a computer. The reality within a reality is a concept that\'s done before... but never like this - The Matrix requires absolute attention from his audience, least you\'ll be completely lost in a few minutes. With the help of supporting cast members Laurence Fishburne and Carrie-Anne-Moss, Reeves discovers that inside this simulation environment known as The Matrix, he can control, bend, and manipulate space-time... resulting in some of the most incredibly iconic images to ever grace the silver screen. The slow-motion "bullet-time" effects as they\'re known today were groundbreaking and revolutionary when we first saw them 12 years ago. Although Reeves is notorious for his inability to really convey much emotional range, his character here lends itself well to him as an actor. This is a movie that makes you think, makes you gasp, and makes totally forget where the 136-minutes went after finishing it. It\'s no wonder this film spawned two very successful sequels, and dozens of copy-cats. The Matrix, "Visually revolutionary, and mind-blowing." Keanu Reeves plays the main character, Neo, or Thomas A. Anderson and while he is not the perfect actor I think he does a pretty good job in The Matrix. He doesn\'t have the longest of lines which was probably a deliberate choice from the directors and it works because this gives him a better opportunity to work on posture and facial expressions and I must say that overall his body language is very good. Very clear and well defined. Laurence Fishbourne plays Neo\'s mentor Morpheus and he does an excellent job of it. His lines flow with a certain confidence and style that makes his character somewhat unique and interesting. Carrie-Anne Moss does a good job as well and succeeds in looking both cool and sexy in her leather outfit. Joe Pantoliano, a critically underrated actor does a brilliant job of bringing his character, Cypher, to life. He also played the roles in Underrated Daredevil (2003)and Bad Boys I & II. I can\'t say much about him because his character is pretty essential to the plot. Gloria Foster appears in a relatively small role that will have greater significance in the following films and she does a very good job. The best acting is provided by Hugo Weaving, however, in his portrayal of Agent Smith. It is really something to watch him act out the changes in his character. Agent Smith gains some human traits like anger, sense of dread, hate and eventually even a sly sense of humor. Two thumbs way up to Weaving who has created one of the finest screen villains of all time.Effects wise the film is simply stunning and it deservedly was awarded the Oscar for best effects (and was regrettably cheated out of a nomination in the Best Film category) ahead of even Star Wars. The reason that I think The Matrix deserves the Oscar for best effects is simply that the effects in The Matrix are more innovative than the ones in Star Wars. Just take a look at how many times the effects have been spoofed and you\'ll probably agree. The effects also help in the symbolism of the film and in creating a very dystopia atmosphere not unlike the one seen in Blade Runner and this works brilliantly. The film looks beautiful at all times and today 16 years later (my God has it already been 16 years?) the effects still hold their ground against new science fiction films. Add the effects to the brilliant editing and you have a visual masterpiece on your hands. Very well done.The film also won 4 Academy Awards including for best visual effects. \n10/10 for one of the best epic American science fiction action film\'s in the history movies like this and Aliens (1986) don\'t exist anymore. It is one of my personal favorite movies, it is the movie I saw with my mom in the movie theater it is memories on my mom who is no longer with us anymore and I miss he.',
 u'The Matrix is one of the best sci-fi movies ever made!I read it used to be one of Tarantino\'s favourite ones too, but the second and third ones ruined the whole beautiful idea and brought the overall value of the movie pretty low.The story/philosophy is mind-blowing and the directing, special effects and the overall performance absolutely amazing. Thomas Anderson (Keanu Reeves) is an IT specialist in his day-to-day job and a hacker (alias Neo) during the night. He is looking for Morpheous (Lawrence Fishburn) who could give him the answer to the question \'What is the Matrix?\' The more relevant questions to ask, though, would be "How would you know the difference between the dream world and the real world?" and "How would you know if what is happening right now is not a virtual reality?"The antagonist, Agent Smith (Hugo Weaving), is one of my favourite characters. Just watch his mouth - the way he speaks, the tone and pitch of his voice, his pronunciation, even the pauses he makes while speaking, have an incredible effect on the viewer. The concept of people being like cancer to this world is absolutely astounding.Carrie Ann Moss was the perfect choice for Trinity because any other blond, big lipped, big brested \'bimbo\' would have totally ruined her character.A few more questions pop up in one\'s mind. "Haven\'t we gone too far creating such super intelligent machines?", "What if they become so powerfull one day and take control over humanity?"The acting\'s just amazing, all of them, no exceptions. The 4 leading characters though... (Morphy, Trinity, Neo & Agent Smith) - (yes, to me there are 4 leads) - I love them all. I can\'t imagine the movie with any other 4 actors but L. Fishbourn, C.A. Moss, Keanu and H. Weaving. They were born to play those roles.Whoever did the casting did a terrific job. Bravo!Brilliant, absolutely brilliant!',
 u"The Wachowski brothers really did excel themselves with this movie. It's a brilliant movie on a number of different levels - the directing is excellent, the camera work is great, the visuals are stunning, the kung-fu is A+, acting is executed with style and conviction, and the plot is truly inspired. It's really hard to use enough superlatives on this movie!It'd be a 10/10, except for the ending. Having Neo do what he does at the end really lets it down, in my opinion. However, there's a couple of sequels on the way, so let's see what the Wachowskis can do to make up for it.Other than that, (and like I said above) the movie is operating on so many different levels that each time you watch it, you pick up something new... this isn't by accident, either. The Wachowski brothers had the actors read a number of definitive works (Simulation & Simulcra was one I believe) in modern literature and psychology, and applied liberal dashings of aspects of the major religions to provide the best sci-fi movie of the decade, if not ever.I'm yet to meet somebody who hasn't enjoyed it. It's my favourite movie to watch on a good cinema system, too.",
 u'It\'s been a while since a movie has generated enough interest in me for me to watch it.  "The Matrix" looked exciting enough in the trailers, so I decided to give it a look.  What I found was an amazing movie, with some of the greatest special effects I\'ve ever seen.  The camera angles really work for the action sequences and the choreographed fight scenes made me yearn for more.  Say what you want about Keanu Reeves\' acting.  He may not deliver the best dialogue, but his look can carry a film.  He was a great choice for the role of Neo.  Carrie Anne Moss was great as was the underrated Laurence Fishburne.  I highly recommend this film for those who are a fan of visually stunning movies.  It will blow away your senses...',
 u'The Wachowski Brothers vision of a possible future takes the visual and sound aspects of filmmaking to a new high. Incorporating older still photography with computer enhancement to the degree that appears on the screen has raised the genre to a level that will be very hard-pressed by filmmakers for a number of years. Acting was wonderful, script, visual, sound, everything about this film is a tribute to a usually overlooked genre.',
 u"By the end of the 90s, there hadn't been much in terms of fresh, new sci-fi that we hadn't seen before. Or so I thought.The Matrix combined the best of hard science-fiction with Asian cinema's frenetic and masterfully choreographed action sequences, gorgeous direction, pitch-perfect casting, and the best martial arts I'd ever seen in a Hollywood film.I don't think the sequels did it justice, but The Matrix - as a stand-alone film - remains one of my all time favorites in ANY genre."]

Topic Modeling

Build custom stop word list

In [7]:
from sklearn.feature_extraction import text 
extra_stop_words = ["film", "movie", "just", "going", "story",  "goku", "nolan", "piccolo", "james", "dicaprio",
        "series", "cartoon", "know", "going", "does", "mal" , "didn", "actually", "neo", "cobbs", "boyle", "icarus",
        "make", "things", "page", "job", "haven", "say", "don", "does", "matrix", "sunshine", "dragonball", "inception",
        "movies", "christopher", "gordon-levitt", "joseph", "yamcha", "roshi", "cobb", "michael", "caine", "ellen", "saito", 
        "ariadne", "murphy", "cillian", "dragon", "ball", "hugo", "weaving", "keanu", "reeves", "danny", "hey"]
stopwords = text.ENGLISH_STOP_WORDS.union(extra_stop_words)

Create vectors and clean up data (keep letters and remove stop words

In [8]:
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words=stopwords, lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(reviews)

Information on generated data vector

In [9]:
data_vectorized
Out[9]:
<198x1089 sparse matrix of type '<type 'numpy.int64'>'
	with 13331 stored elements in Compressed Sparse Row format>

We will create 6 topics. Goal of our analysis is to categorize all reviews in categories than can cover most of the distance map.

In [10]:
NUM_TOPICS = 6

Build a Latent Dirichlet Allocation Model

In [11]:
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=20, learning_method='online', random_state=41)
lda_Z = lda_model.fit_transform(data_vectorized)
print("LDA Model:")
print("Shape: ",lda_Z.shape) 
print("Investigate the weights of the first corpus document in each topic space:")
print(lda_Z[0])
LDA Model:
Shape:  (198L, 6L)
Investigate the weights of the first corpus document in each topic space:
[0.00177988 0.00177985 0.00177684 0.00179332 0.52670549 0.46616462]

Function to assist us in inspecting the inferred topics

In [12]:
def print_topics(model, vectorizer, top_n=15):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])

Inspect LDA model inferred topics

In [13]:
print("LDA Model Topics:")
print_topics(lda_model, vectorizer)
LDA Model Topics:
Topic 0:
[(u'crew', 23.782521226791783), (u'sun', 21.2927910596878), (u'space', 20.043239244932128), (u'danny', 15.479991711885523), (u'mission', 11.905406909923759), (u'capa', 10.061916201727385), (u'later', 8.535442329116231), (u'bomb', 8.045593481644401), (u'mace', 6.509637041306173), (u'evans', 6.152153403636083), (u'like', 6.0766243319761655), (u'ship', 6.02749555324673), (u'plays', 5.997168940063968), (u'suit', 5.923218519572942), (u'chris', 5.860276472288744)]
Topic 1:
[(u'idea', 3.134804553067741), (u'blockbuster', 3.1273941063663573), (u'cinema', 2.772592979653157), (u'script', 2.4628448021051166), (u'beautiful', 2.349917091768036), (u'performance', 2.3410104400862584), (u'ideas', 2.3028835500674814), (u'films', 2.17351910291112), (u'shutter', 2.155439996160868), (u'island', 2.1467577400435136), (u'spectacle', 1.965838905369298), (u'play', 1.8085213677336995), (u'audience', 1.7405197496580576), (u'plant', 1.7352473464144484), (u'piece', 1.6518985602925536)]
Topic 2:
[(u'like', 5.613285150784102), (u'hey', 4.49744118980215), (u'characters', 3.4989370593685543), (u'mind', 3.4845765520931424), (u'memento', 3.1419847443830737), (u'work', 3.116491624670875), (u'words', 3.105210473390925), (u'shouldn', 3.098443367008532), (u'thing', 3.0855116030705876), (u'act', 3.0788565979205016), (u'liked', 3.066490165626107), (u'train', 3.060457575735968), (u'complete', 2.6152684633367045), (u'acting', 2.497146195410836), (u'beginning', 2.4652703928876547)]
Topic 3:
[(u'best', 18.769612346900114), (u'dark', 12.687918031950534), (u'action', 12.63392697352488), (u'plot', 11.315489682000372), (u'time', 11.03571152113231), (u'knight', 10.999331612318247), (u'mind', 10.376641898120214), (u'complex', 9.436252628723818), (u'memento', 8.302965229880533), (u'audience', 8.05484146922981), (u'batman', 7.427353658106022), (u'cast', 7.373052608743046), (u'hardy', 7.285679079443403), (u'idea', 7.141177991775658), (u'tom', 7.063052174635041)]
Topic 4:
[(u'like', 218.44282625010112), (u'characters', 123.44412104888697), (u'people', 115.00962580057363), (u'really', 107.41257016487589), (u'good', 100.26362935411788), (u'time', 95.3617743665186), (u'think', 94.74896479092378), (u'bad', 88.61071792026576), (u'did', 87.81986400545057), (u'effects', 85.10818214984147), (u'great', 84.61524687525072), (u'way', 83.90658895917905), (u'action', 77.08796949644118), (u'plot', 75.6053095773464), (u'sun', 72.69664255600313)]
Topic 5:
[(u'dream', 81.89914517322012), (u'world', 36.4806828946509), (u'effects', 35.5156184122598), (u'time', 34.78871664178316), (u'reality', 34.683921015054345), (u'dreams', 32.20426512694232), (u'best', 31.90570307082335), (u'like', 25.19311097925555), (u'idea', 25.085241551266037), (u'dreaming', 22.043598066209704), (u'character', 20.47595736113575), (u'mind', 20.03843026221371), (u'action', 19.941961415405885), (u'people', 19.84801834957171), (u'life', 18.65638959455922)]

Testing with an unknown document

In [14]:
test = "Watching The Matrix Reloaded, one is absolutely entitled to say that it is overloaded, too lengthy action sequences for instance, and indeed, a way too lengthy dancing scene in Zion. But next to that, it is obvious that this sequal to The Matrix (1999)takes the story to a whole new dimension. Different characters define the working of the matrix, and the meaning of life itself, in different ways, depending on their onthological background. A conclusion is not (yet) given, which adds to the movie a kind of postmodern quality. For as far as the action sequences are concerned: Groundbreaking. You'll see stuff that you've never seen before. Sometimes the scenes are a little lengthy, which harmes the narrative, but that is compensated easily by the visual spectacle. And yes, the Architect at the end is difficult to understand, but when you watch the film more than once, you'll find out that it does make sense what he says. All together this movie may not be as fantastic as 'The Matrix', but it is definitely a good movie that will keep you thinking for a while."

x = lda_model.transform(vectorizer.transform([test]))[0]
print(x)
[0.003718   0.03170116 0.00372229 0.00373553 0.95336823 0.0037548 ]

Build a Non-Negative Matrix Factorization Model

In [15]:
# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS, random_state=77)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print("NMF Model:")
print("Shape: ",nmf_Z.shape) 
print("Investigate the weights of the first corpus document in each topic space:")
print(nmf_Z[0])
NMF Model:
Shape:  (198L, 6L)
Investigate the weights of the first corpus document in each topic space:
[0.         0.         0.         0.73114997 0.21088511 0.3494002 ]

Inspect NMF model inferred topics

In [16]:
print("NMF Model Topics:")
print_topics(nmf_model, vectorizer)
NMF Model Topics:
Topic 0:
[(u'like', 3.883134744216298), (u'people', 2.044360540753905), (u'good', 1.4570999951462107), (u'action', 1.3028330292708146), (u'think', 1.0652294401700644), (u'really', 0.9726618560644266), (u'original', 0.9410660636552226), (u'look', 0.8057047624068409), (u'time', 0.7775685841475836), (u'seen', 0.7658495669302319), (u'way', 0.7294031495316643), (u'based', 0.7176372015067002), (u'better', 0.6925511553642099), (u'anime', 0.6907590611565193), (u'manga', 0.6745741539538328)]
Topic 1:
[(u'dream', 3.3929235977733874), (u'idea', 1.9420815824774837), (u'mind', 1.5226777355110162), (u'world', 1.3452553550900421), (u'dreams', 1.236968733232978), (u'reality', 0.9170668120028558), (u'time', 0.8839838327117113), (u'children', 0.7862816057339274), (u'real', 0.7213300100742887), (u'dreaming', 0.7173276940036859), (u'ideas', 0.646665760790749), (u'people', 0.5673783994924602), (u'subconscious', 0.5466341939247431), (u'wife', 0.5277293831277577), (u'head', 0.5252375399542784)]
Topic 2:
[(u'sun', 2.871185187639535), (u'crew', 1.830553470901554), (u'ship', 1.1933364544094118), (u'space', 0.9308766698818415), (u'mission', 0.8990011012056722), (u'science', 0.7627313136101969), (u'earth', 0.7209762624196228), (u'garland', 0.6997360458624511), (u'sci-fi', 0.620383373123417), (u'life', 0.5993059189503538), (u'characters', 0.597331866850502), (u'star', 0.5371088404021279), (u'really', 0.5319034647333868), (u'light', 0.5247952590726883), (u'way', 0.4924934341490592)]
Topic 3:
[(u'best', 2.1411889974080083), (u'effects', 2.106582751678501), (u'character', 1.2994349007472277), (u'action', 1.106929934996667), (u'films', 1.0459586279216089), (u'time', 1.0204789155922143), (u'good', 1.0022689641501052), (u'think', 0.9881529734605432), (u'makes', 0.7687528659935056), (u'elements', 0.7352073551603333), (u'pretty', 0.667677734011367), (u'times', 0.6485288127734306), (u'seen', 0.6341868737900137), (u'years', 0.633363000216926), (u'science', 0.6109191853294511)]
Topic 4:
[(u'characters', 2.073395899460106), (u'special', 1.971751623850461), (u'effects', 1.9699782312920662), (u'good', 0.9200303172661818), (u'great', 0.8942835287207404), (u'bad', 0.7968253727404507), (u'think', 0.7207774376618445), (u'care', 0.6243802626537839), (u'world', 0.587083016498772), (u'long', 0.5726824218796035), (u'cgi', 0.5698994319480514), (u'rating', 0.5219131918112067), (u'really', 0.47941579268288365), (u'people', 0.47406317298006784), (u'time', 0.4664980692951951)]
Topic 5:
[(u'like', 1.9803752055892367), (u'did', 1.790786969058529), (u'bad', 1.1035450268041398), (u'looks', 1.0255101668984348), (u'casting', 0.8605436021104286), (u'effect', 0.7998651548366158), (u'work', 0.6775075212015731), (u'doesn', 0.6749373067964147), (u'wrong', 0.6534639974751703), (u'place', 0.5702533285342001), (u'budget', 0.5500809658683411), (u'believe', 0.5480977893443805), (u'visual', 0.5169555533882988), (u'end', 0.49789314974082627), (u'cut', 0.48282625606526064)]
In [17]:
test = "Watching The Matrix Reloaded, one is absolutely entitled to say that it is overloaded, too lengthy action sequences for instance, and indeed, a way too lengthy dancing scene in Zion. But next to that, it is obvious that this sequal to The Matrix (1999)takes the story to a whole new dimension. Different characters define the working of the matrix, and the meaning of life itself, in different ways, depending on their onthological background. A conclusion is not (yet) given, which adds to the movie a kind of postmodern quality. For as far as the action sequences are concerned: Groundbreaking. You'll see stuff that you've never seen before. Sometimes the scenes are a little lengthy, which harmes the narrative, but that is compensated easily by the visual spectacle. And yes, the Architect at the end is difficult to understand, but when you watch the film more than once, you'll find out that it does make sense what he says. All together this movie may not be as fantastic as 'The Matrix', but it is definitely a good movie that will keep you thinking for a while."

x = nmf_model.transform(vectorizer.transform([test]))[0]
print(x)
[0.2723074  0.         0.01375711 0.00339458 0.         0.08016057
 0.27420886 0.01238357]

Visualizations

Visualization for LDA Model using pyLDAvis

In [17]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel
C:\Users\Evan\Anaconda2\lib\site-packages\pyLDAvis\_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  return pd.concat([default_term_info] + list(topic_dfs))
Out[17]:

Visualization for NMF Model using pyLDAvis

In [18]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(nmf_model, data_vectorized, vectorizer, mds='tsne')
panel
C:\Users\Evan\Anaconda2\lib\site-packages\pyLDAvis\_prepare.py:223: RuntimeWarning: divide by zero encountered in log
  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
C:\Users\Evan\Anaconda2\lib\site-packages\pyLDAvis\_prepare.py:240: RuntimeWarning: divide by zero encountered in log
  log_lift = np.log(topic_term_dists / term_proportion)
C:\Users\Evan\Anaconda2\lib\site-packages\pyLDAvis\_prepare.py:241: RuntimeWarning: divide by zero encountered in log
  log_ttd = np.log(topic_term_dists)
Out[18]:

Conclusion

As we expected NMF provided the best results. As we can see from the distance map above, the topics generated from NMF are almost equally weighted, are clearly seperated and cover up the majority of the plane. Finally looking at the top topic words, we can clearly see that NMF produced better and more unique topis. LDA performed worse since the topic generated had variable weights, some of them were very close to eachother and they ended up covering up a smaller space in the plane. This is understandable since LDA requires a large dataset to work efficiently.

LDA topic results

In [19]:
# LDA Results

lda_res = [0,0,0,0,0,0]

lda_res[0] = "Topic 0 - team perfect dangerous world"
lda_res[1] = "Topic 1 - effects action best world science"
lda_res[2] = "Topic 2 - sequence philosophical speed answer"
lda_res[3] = "Topic 3 - dream people like idea mind reality"
lda_res[4] = "Topic 4 - narrative sound visual machine vision"
lda_res[5] = "Topic 5 - like time sun think way action"

print ("LDA Topic Descriptions: ")
lda_res
LDA Topic Descriptions: 
Out[19]:
['Topic 0 - team perfect dangerous world',
 'Topic 1 - effects action best world science',
 'Topic 2 - sequence philosophical speed answer',
 'Topic 3 - dream people like idea mind reality',
 'Topic 4 - narrative sound visual machine vision',
 'Topic 5 - like time sun think way action']

NMF Topic Results

In [20]:
# NMF Results

nmf_res = [0,0,0,0,0,0]

nmf_res[0] = "Topic 0 - like people action original anime"
nmf_res[1] = "Topic 1 - dream idea mind reality time"
nmf_res[2] = "Topic 2 - sun crew space ship mission earth"
nmf_res[3] = "Topic 3 - best action time think elements"
nmf_res[4] = "Topic 4 - special effects great world care"
nmf_res[5] = "Topic 5 - like bad looks effect work place"

print ("NMF Topic Descriptions: ")
nmf_res
NMF Topic Descriptions: 
Out[20]:
['Topic 0 - like people action original anime',
 'Topic 1 - dream idea mind reality time',
 'Topic 2 - sun crew space ship mission earth',
 'Topic 3 - best action time think elements',
 'Topic 4 - special effects great world care',
 'Topic 5 - like bad looks efect work place']

Wordcloud

Wordcloud of each review

In [49]:
for i in range(100):
    wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(reviews[i])
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()